Quantitative analysis of treebanks using frequent subtree mining methods
نویسنده
چکیده
The first task of statistical computational linguistics, or any other type of datadriven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.
منابع مشابه
PrefixTreeESpan: A Pattern Growth Algorithm for Mining Embedded Subtrees
Frequent embedded subtree pattern mining is an important data mining problem with broad applications. In this paper, we propose a novel embedded subtree mining algorithm, called PrefixTreeESpan (i.e. Prefix-Treeprojected Embedded-Subtree pattern), which finds a subtree pattern by growing a frequent prefix-tree. Thus, using divide and conquer, mining local length-1 frequent subtree patterns in P...
متن کاملFrequent Subtree Mining - An Overview
Mining frequent subtrees from databases of labeled trees is a new research field that has many practical applications in areas such as computer networks, Web mining, bioinformatics, XML document mining, etc. These applications share a requirement for the more expressive power of labeled trees to capture the complex relations among data entities. Although frequent subtree mining is a more diffic...
متن کاملEvoMiner: Frequent Subtree Mining in Phylogenetic Databases Technical Report #11-08, Dept. of Computer Science, Iowa State University
The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to make sense of the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like le...
متن کاملPCITMiner- Prefix-based Closed Induced Tree Miner for finding closed induced frequent subtrees
Frequent subtree mining has attracted a great deal of interest among the researchers due to its application in a wide variety of domains. Some of the domains include bio informatics, XML processing, computational linguistics, and web usage mining. Despite the advances in frequent subtree mining, mining for the entire frequent subtrees is infeasible due to the combinatorial explosion of the freq...
متن کاملClustering XML Documents Using Closed Frequent Subtrees: A Structural Similarity Approach
This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progres...
متن کامل